Authors


Hector R. Gavilanes Chief Information Officer
Gail Han Chief Operating Officer
Michael T. Mezzano Chief Technology Officer


University of West Florida

November 2023

Agenda

  • Introduction
  • Method
  • Example
  • Application
  • Conclusion

Principal Component Analysis (PCA)

  • Unsupervised Machine Learning
  • Dimensionality Reduction Technique
  • Data Exploration
  • Feature Extraction
  • Data Visualization
  • Simplification of complex dataset
  • Principal Component (PC): Capture variances explaining the original variables
  • Mitigate multicollinearity

Assumptions and Limitations

  • Scaling Data
  • Loss of interpretability in transformed features.
  • Loss of Information

Why Use PCA?

  • Reducing Dimensionality: Simplify high-dimensional data.
  • Visualizing Data: Help visualize data in lower dimensions.
  • Noise Reduction: Eliminate less relevant features.
  • Improved Model Performance: Enhance machine learning efficiency.

Dimensionality Reduction

  • Unsupervised Learning.
  • Reduce Dimensions: Transform data by multiplying with selected eigenvectors.
  • New Feature Space: Data exists in a lower-dimensional feature space.

Visualization

  • Data Projection: Visualize data in the reduced feature space.
  • Scatterplots: Use scatterplots to visualize data distribution.

Methods

  • Data matrix \(X\) of size \(N\) x \(P\).
  • Data is linearly related.
  • Continuous and normally distributed data.
    • Initial data distribution does not truly matter.
  • Variables are similar in scale and without extreme outliers.
  • Missing data: Imputation or removal of observations.
  • Centering and scaling: Transform variables to a mean of 0 and a standard deviation of 1. \[ z_{np} = \frac{x_{np} - \bar{x}_{p}}{{\sigma_{p}}} \]
  • Covariance: A measure of how two random variables vary together. \[ Cov(x,y) = \frac{\Sigma(x_i-\bar{x})(y_i-\bar{y})}{N} \]
  • Covariance Matrix: Symmetric \(p \times p\) matrix which gives the covariance values for each pair of variables in the dataset.
  • Nonzero vector whose direction is unaffected by a linear transformation.
  • An eigenvector is scaled by factor \(\lambda\), the eigenvalue.
  • Each principal component is given by the eigenvectors of the covariance matrix.
    • The eigenvectors represent the directions of the new principal axes.
    • The eigenvalues represent the magnitude of these eigenvectors.

Finding the Principal Components

  • Find the linear combination of the columns of \(X\) (the variables) which maximizes variance.
  • Let \(a\) be a vector of constants \(a_1, a_2, a_3, …, a_p\) such that \(Xa\) represents the linear combination which maximizes variance.
  • The variance of \(Xa\) is represented by \(var(Xa) = a^TSa\) with the covariance matrix \(S\).
  • Finding the \(Xa\) with maximum variance equates to finding the vector \(a\) which maximizes the quadratic \(a^TSa\), where \(a^Ta = 1\).
  • \(a\) is a unit-norm eigenvector with eigenvalue \(\lambda\) of the covariance matrix \(S\).
  • The largest eigenvalue of \(S\) is \(\lambda_1\) with the eigenvector \(a_1\), which we can define for any eigenvector \(a\): \[ var(Xa) = a^TSa = \lambda a^Ta = \lambda \]

Principal Components

  • Impose the restriction of orthogonality to the coefficient vectors of \(S\).
    • Ensure the principal components are uncorrelated.
  • The eigenvectors of \(S\) represent the solutions to finding \(Xa_k\) which maximize variance while minimizing correlation with prior linear combinations.
  • Each \(Xa_k\) is a principal components of the dataset having eigenvectors \(a_k\) and eigenvalues \(\lambda_k\).
  • The elements of \(Xa_k\) are the factor scores of the PCs.
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The elements of \(Xa_k\).
  • How each observation scores on a PC.
  • In a geometric interpretation of PCA the factor scores measure length (magnitude) on the Cartesian plane.
  • This length represents the projection of the original observations onto the PCs from the origin at \((0, 0)\).
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The loadings represent the weights of the original variables in the computation of the PCs.
  • The correlation from -1 to 1 of each variable with the factor score.
  • Eigenvectors: Represent directions of maximum variance.
  • Eigenvalues: Indicate the variance explained by each eigenvector.
  • Sorting: Sort eigenvalues in descending order to select the most significant principal components.

Example

  • For this example of PCA, the Abalone dataset from the UCI Machine Learning Repository is used.
  • This dataset contain 4177 observations of 9 variables which record characteristics of each abalone including sex, length, diameter, height, weights, and the number of rings.
  • The variables, apart from sex, are continuous and correlated.

Preprocessing the data

  • Exclude non-numeric variables from the dataset.
    • The variable Sex is excluded.
  • Check for missing data.
    • No missing data in the dataset.
  • Scale and center the data.
  • Check for and handle extreme outliers.
    • Outliers do not present a large problem.

Perform Principal Component Analysis

The prcomp() function performs principal component analysis on a dataset using the singular value decomposition method with the covariance matrix of the data.

  • The standard deviation for each PC represents the information captured by that principal component.
  • The proportion of variance is the percent of total variance captured by each PC.
  • The cumulative proportion gives the total variance caputured by the PC and all prior PCs.

Visualizing the results

Interpreting the results

  • The loadings of the first two principal components show the contribution of each variable to PC1 and PC2.

Applications of PCA

  • Image Compression: Reduce image size while preserving details.
  • Face Recognition: Reduce facial feature dimensions for classification.
  • Anomaly Detection: Identify anomalies in large datasets.
  • Bioinformatics: Analyze gene expression data.

Dataset

  • data collected from 50 US states + 6 U.S. territories
  • 39 variables
    • 24 patient care quality in dialysis facilities
    • 14 characteristics of dialysis patients
    • Index variable: categorical variable was removed

Dataset Summary

Dataset Selection Rationale

  • Driven by multicollinearity.

  • Features less significant in explaining variability.

  • All variables are numeric

  • Categorical Index variable.

Data Preparation

  • Efficient removal of white spaces in the dataset.

  • Editing variable names to enhance readability and meaningful.

Original: “Percentage.Of.Adult..Patients.With.Hypercalcemia..Serum.Calcium.Greater.Than.10.2.Mg.dL.”

Edited: “hypercalcemia_calcium > 10.2Mg.”

Missing Values

  • 34 missing values.

  • Imputation of missing values using the \(Mean\) (\(\mu\))

Distribution

  • Normality is not assumed.

QQ-Plot of Residuals

  • Outliers are present through the entire dataset

Standardization

  • Mean (\(\mu\)=0); Standard Deviation (\(\sigma\)= 1)

    \[ Z = \frac{{ x - \mu }}{{ \sigma }} \]

    \[ Z \sim N(0,1) \]

Outliers & Leverage

  • 3 Outliers

  • No leverage

  • Minimal difference.

  • No observations removed.

Correlations

  • Multicollinearity is present.
  • Threshold = 0.30.
  • 28 Correlated features.

Scree Plot

  • PC1 explains 40.8% variance.
  • PC2 explains 9.5% variance.

BiPlot

  • PC1 in black displays longest distance of its projection.
  • PC2 in blue displays shorter distance as expected.

Contribution of Variables

PCA in Machine Learning

  • Feature Extraction: Use relevant features to create principal components.
  • Preprocessing: Standardize, or normalize data before applying PCA.
  • Model Training: Enhance model performance by reducing dimensionality.

Modeling

# reproducible random sampling
set.seed(my_seed)  
 
# Create Target y-variable for the training set
y <- train_data$expected_survival  
# Split the data into training and test sets 
split <- sample.split(y, SplitRatio = 0.7) 
training_set <- subset(train_data, split == TRUE) 
test_set <- subset(train_data, split == FALSE) 
# Perform centering and scaling on the training and test sets
sc <- preProcess(training_set[, -target_index], 
                 method = c("center", "scale"))
training_set[, -target_index] <- predict(
  sc, training_set[, -target_index])
test_set[, -target_index] <- predict(sc, test_set[, -target_index])
# Perform Principal Component Analysis (PCA) preprocessing on the training data
pca <- preProcess(training_set[, -target_index], 
                  method = 'pca', pcaComp = 8)

# Apply PCA transformation to original training set
training_set <- predict(pca, training_set)

# Reorder columns, moving the dependent feature index to the end
training_set <- training_set[c(2:9, 1)]

# Apply PCA transformation to original test set
test_set <- predict(pca, test_set)

# Reorder columns, moving the dependent feature index to the end
test_set <- test_set[c(2:9, 1)]

Uncorrelated Matrix

8 Principal Components

PC Regression

8 Components

2 Components

Cross-Validation Model

# Cross-validation with n folds
k_10 <- trainControl(method = "cv", number = 10)

# training the model 
model_cv <- train(expected_survival ~ ., 
                  data = train_pca,
                  method = "lm",
                  trControl = k_10)

# Print Model Performance
print(model_cv)

Results

  • Principal component analysis was performed using a singular value decomposition approach.
  • PC1 captures 40.80% of the variance in the data.
    • PC1 and PC2 capture 50.27% of the variance.
      • The first four PCs capture 67.66% of the variance, or just over two-thirds.
  • After the fourth PC, the variance captured by each successive PC begins to diminish relative to PCs one through four.
    • The first ten PCs capture 88.67% of the variance.
      • Over 90% of the information in the dataset can be explained by the first eleven PCs.
  • The variables which contribute the most to PC1 are
    • expected_hospital_readmission
    • expected_transfusion
    • expected_hospitalization
  • PC2, which is orthogonal to PC1, has relatively large contributions from the five variables measuring levels of phosphorus.
  • Principal component regression was performed with expected_survival used as the response variable.
  • The estimates and significance of each PC regressor demonstrates the differences between variance captured from the data and usefulness in a linear model.
    • For example, PC4 is a significant regressor despite capturing less variance than PC3 in the training data.
  • Both models produced an \(R^2\) above 96% and a predicted \(R^2\) above 95% with a 1% advantage on the cross-validation model.

Discussion & Conclusion

  • Summary: PCA is an unsupervised learning technique for dimensionality reduction and data visualization.
  • Key Takeaways: Understand eigenvectors, eigenvalues, and explained variance.

Questions and Comments

References

  1. M. Ringnér, “What is principal component analysis?” Nature biotechnology, vol. 26, no. 3, pp. 303–304, 2008.
  2. I. T. Jolliffe and J. Cadima, “Principal component analysis: A review and recent developments,” Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.
  3. B. M. S. Hasan and A. M. Abdulazeez, “A review of principal component analysis algorithm for dimensionality reduction,” Journal of Soft Computing and Data Mining, vol. 2, no. 1, pp. 20–30, 2021.
  4. B. Everitt and T. Hothorn, An introduction to applied multivariate analysis with r. Springer Science & Business Media, 2011.
  5. M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022.
  6. K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin philosophical magazine and journal of science, vol. 2, no. 11, pp. 559–572, 1901.
  7. R. A. Fisher and W. A. Mackenzie, “Studies in crop variation. II. The manurial response of different potato varieties,” The Journal of Agricultural Science, vol. 13, no. 3, pp. 311–320, 1923.
  8. H. Hotelling, “Analysis of a complex of statistical variables into principal components.” Journal of educational psychology, vol. 24, no. 6, p. 417, 1933.
  9. D. Esposito and F. Esposito, Introducing machine learning. Microsoft Press, 2020.
  10. M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
  11. S. Zhang and M. Turk, “Eigenfaces,” Scholarpedia, vol. 3, no. 9, p. 4244, 2008.
  12. F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
  13. J. Maindonald and J. Braun, Data analysis and graphics using r: An example-based approach, vol. 10. Cambridge University Press, 2006.
  14. J. Lever, M. Krzywinski, and N. Altman, “Points of significance: Principal component analysis,” Nature methods, vol. 14, no. 7, pp. 641–643, 2017.
  15. F. L. Gewers et al., “Principal component analysis: A natural approach to data exploration,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021.
  16. J. Hopcroft and R. Kannan, Foundations of data science. 2014.
  17. “Quarterly dialysis facility care compare (QDFCC) report: July 2023.” Centers for Medicare & Medicaid Services (CMS). Available: https://data.cms.gov/provider-data/dataset/2fpu-cgbb. [Accessed: Oct. 11, 2023]
  18. R Core Team, “Prcomp, a function of r: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp. [Accessed: Oct. 16, 2023]
  19. S. R. Bennett, “Linear algebra for data science.” 2021. Available: https://shainarace.github.io/LinearAlgebra/index.html. [Accessed: Oct. 16, 2023]
  20. D. G. Luenberger, Optimization by vector space methods. John Wiley & Sons, 1997.
  21. S. Nash Warwick and W. Ford, “Abalone.” UCI Machine Learning Repository, 1995.
  22. J. Pagès, Multiple factor analysis by example using r. CRC Press, 2014.
  23. E. K. CS, “PCA problem / how to compute principal components / KTU machine learning.” YouTube, 2020. Available: https://youtu.be/MLaJbA82nzk. [Accessed: Nov. 01, 2023]
  24. F. Chumney, “PCA, EFA, CFA,” pp. 2–3, 6, Sep., 2012, Available: https://www.westga.edu/academics/research/vrc/assets/docs/PCA-EFA-CFA_EssayChumney_09282012.pdf
  25. H. Abdi and L. J. Williams, “Principal component analysis,” WIREs Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010, doi: https://doi.org/10.1002/wics.101. Available: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wics.101
  26. R Core Team, “Lm: Fitting linear models.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm. [Accessed: Nov. 08, 2023]
  27. Kuhn and Max, “Building predictive models in r using the caret package,” Journal of Statistical Software, vol. 28, no. 5, pp. 1–26, 2008, doi: 10.18637/jss.v028.i05. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05
  28. R. Bro and A. K. Smilde, “Principal component analysis,” Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014.